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Abstract — Our paper presents solutions that can significantly 
improve the delay performance of putting and retrieving data 
in and out of cloud storage. We first focus on measuring the 
delay performance of a very popular cloud storage service 
Amazon S3. We establish that there is significant randomness 
in service times for reading and writing small and medium size 
objects when assigned distinct keys. We further demonstrate that 
using erasure coding, parallel connections to storage cloud and 
limited chunking (i.e., dividing the object into a few smaller 
objects) together pushes the envelope on service time distributions 
significantly (e.g., 76%, 80%, and 85% reductions in mean, 
90th, and 99th percentiles for 2 Mbyte files) at the expense of 
additional storage (e.g., 1.75 x). However, chunking and erasure 
coding increase the load and hence the queuing delays while 
reducing the supportable rate region in number of requests per 
second per node. Thus, in the second part of our paper we focus 
on analyzing the delay performance when chunking, FEC, and 
parallel connections are used together. Based on this analysis, we 
develop load adaptive algorithms that can pick the best code rate 
on a per request basis by using off-line computed queue backlog 
thresholds. The solutions work with homogeneous services with 
fixed object sizes, chunk sizes, operation type (e.g., read or write) 
as well as heterogeneous services with mixture of object sizes, 
chunk sizes, and operation types. We also present a simple greedy 
solution that opportunistically uses idle connections and picks 
the erasure coding rate accordingly on the fly. Both backlog and 
greedy solutions support the full rate region and provide best 
mean delay performance when compared to the best fixed coding 
rate policy. Our evaluations show that backlog based solutions 
achieve better delay performance at higher percentile values than 
the greedy solution. 

Index Terms — FEC, Cloud storage, Queueing, Delay 



I. Introduction 

Public clouds have been utilized by web services and 
Internet applications widespread. They provide high degree of 
availability, scalability, and data durability. Yet, there exists 
significant skew in network bound I/O performance neces- 
sitating solutions that provide robustness in a cost effective 
manner H], ID. In this paper, we focus on the cloud storage 
and present solutions that can provide much better delay 
performance for putting files into the cloud storage as well 
as for retrieving them back on demand. In particular, we base 
our analysis on Amazon S3 service as one of the most popular 
cloud storage service. 

A typical cloud storage stores and retrieves objects via their 
unique keys. Each object is replicated several times within the 
cloud and sometimes also further protected by erasure codes to 
more efficiently use the storage capacity while attaining very 
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high durability guarantees pl. Storage provider also monitors 
the load on each storage node and employs dynamic load 
balancing to prevent hot storage nodes that might observe high 
loads or slow nodes that have excessively high response times. 
Although mainly used for repairing data in unavailable storage 
nodes, some cloud providers also access coded blocks in 
parallel to uncoded blocks when uncoded blocks are stored in 
slow nodes [3]. Despite all these mechanisms, still evaluations 
of large scale systems indicate that there is a high degree 
of randomness in delay performance |fr|. Thus, the services 
that require better delay performance must deploy their own 
solutions such as sending multiple requests (in parallel or 
sequentially), chunking large objects into smaller ones and 
read/write each chunk in parallel, replicate the same object 
using multiple distinct keys, etc. 

To this end, we conducted our own measurements on 
Amazon S3 for various object sizes to model its delay dis- 
tribution. Our measurement results confirm that the delay 
spread is significant even when object sizes are in the order of 
megabytes. Moreover, our study indicates that when the server 
accessing the storage cloud is not the bottleneck (in terms of 
CPU and network access speed), we can substantially improve 
the distribution of read/write delays. To achieve these gains, 
one has to consider not only chunking and parallel access to 
each chunk, but also erasure coding. In fact without erasure 
coding, more chunking starts hurting the performance at lower 
percentile values. The gains when forward error correction 
(FEC) is employed are significant in the average delay perfor- 
mance and they are much better at higher percentile delays. 

Nonetheless, server accessing the storage cloud has limited 
CPU and network access speed limiting the number of con- 
current connections to the storage cloud without going into 
a processor sharing mode. With limited system capacity, one 
has to consider the load and its impact on queueing delays 
to quantify the total delay. Unfortunately, FEC and chunking 
create redundant load multiplying the arrival rate into the 
system. Unless mean service rate is improved to the same 
extent, the maximum rate at which end users can be served 
is reduced. Our observations over Amazon S3 indicate that 
indeed lower code rates reduce the supportable rate region 
inducing queue instability earlier than higher code rates. Thus, 
it is imperative to design a load adaptive strategy for changing 
FEC rates on the fly to keep total average delays at the 
minimum level while remaining in the achievable rate region 
of the uncoded system. 

To come up with meaningful solutions, one needs to analyze 
queuing delay for the system. As one of the main contributions 
of the paper, we analyze the average delay performance of a 
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system that incorporates chunking, FEC and multiple servers. 
This system model is much harder than an M/G/k queue, 
which itself have only crude approximations, as the service 
times of servers become interdependent due to the use of 
erasure coding. To make this point more clear, consider the 
case where an object is divided into two parts and a third 
part is generated by bit-wise XOR. If three servers are idle, 
then each part can be accessed in parallel. As soon as any 
two server complete their jobs, the third server can preempt 
its current job as erasure coding renders the completion of this 
job irrelevant. Except for a very recent work 1]4] that targets 
to solve a much simpler yet still hard case, to the best of our 
knowledge queuing analysis for such a system model is quite 
an uncharted area. Our analysis provide a good approximation 
for capacity and mean delay for homogeneous traffic with one 
operation type (e.g., all reads) and file size as well as for 
heterogeneous traffic with mixture of traffic types (e.g., both 
read and write requests with varying chunking and file sizes). 

As another major contribution, we develop three load adap- 
tive FEC schemes that change the coding rate on the fly. 
Using the analysis results, we can actually identify under 
what load regimes which fixed FEC strategy provides the best 
average delay performance leading to simple backlog threshold 
based adaptive algorithms. We present two schemes BAFEC 
(for single type of requests, i.e., homogeneous traffic) and 
MBAFEC (for multiple types of requests, i.e., heterogeneous 
traffic) that adapt FEC rates based on the queue backlog. 
Via simulations using real service time traces from Amazon 
S3, we show that both schemes are able to beat the delay 
performance of any fixed FEC rate policy while achieving 
the rate region of the uncoded strategy. Since both BAFEC 
and MBAFEC require a priori knowledge and put constraints 
on service time distribution of cloud storage to compute the 
optimal thresholds, we also propose a greedy strategy that 
opportunistically determines FEC rates based on the number 
of idle servers at the time of request arrivals. Trace driven 
simulations demonstrate that the greedy strategy performs on 
a par with the queue backlog based strategies in terms of 
total mean delay. Nonetheless, the greedy method performs 
significantly worse in some cases at very high percentile values 
(e.g., at 99.9th percentile). 

The remaining sections are organized as follows. In Sec- 
tion ini we explain our system model in more details. In 
Section [nil we present our measurement results over Amazon 



EC2 and S3. In Section|IV] we study the single-class scenario 
and develop a FEC rate adaptive scheme BAFEC based on 
the analysis, and evaluate its performance through trace- 
driven simulations. In Section |V] we generalize the analysis 
to multi-class scenario and develop a multi-class FEC rate 
adaptive scheme MBAFEC. In Section lVll we cover the related 
Uterature. Finally, we conclude the paper in Section I VII I 

II. System Model 

A. Basic Architecture and Functionality 

The basic system architecture captures how web services 
today utilize public or private storage clouds. The architecture 
consists of proxy servers in the front end and a key-value store 
(referred to as cloud storage) in the backend. 

Proxy servers have two main responsibilities: (1) Present 
a rich service layer that operates on top of the raw cloud 
storage services/interfaces. (2) Optimize the user perceived 
performance. Client requests arrive at any of the proxy servers. 
When client wants to upload a file, proxy server divides the file 
into one or more chunks. Each chunk is stored as an individual 
object with a unique key in the key-value store. When the 
entire file is written successfully, the job is completed and 
a response is sent back to the client. When client wants 
to download a file, proxy server checks which chunks need 
to be fetched from the storage cloud. Proxy generates read 
requests for these chunks and after receiving the complete set 
of chunks, the job is completed and the file is streamed back 
to the cUent. The solutions we present are deployed on the 
proxy server side transparent to the cloud storage. 

Cloud storage has two main purposes; (1) Provide data 
storage with high durability and availability. (2) Provide on 
demand scaling of storage needs. Cloud storage does not inter- 
pret the objects it stores, but rather treats them as byte strings 
with a well-defined length. For high durability and availability, 
typical cloud systems replicate each object several times in 
different physical locations and may use FEC internally. From 
proxy servers' perspective, cloud storage is a black box whose 
internal techniques are unknown. Proxy servers only know the 
response times for each query (e.g., putting, getting, copying, 
deleting objects) it sends to the cloud storage. 

B. Adding FEC Support in Multi-threaded Proxies 

In our design, we employ maximum distance separable 
(MDS) codes IS). Suppose a file is divided into k equal 
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size chunks (with padding). An (n,k) MDS code (e.g., Reed- 
Soloman codes) can expand these k original chunks into n > k 
coded chunks such that any k chunks out of n are sufficient to 
efficiently restore the k original chunks (hence the file itself). 

MDS codes can help reducing the read delays as follows. 
Suppose proxy node have already segmented the requested 
file into k chunks, expanded into Umax chunks using an 
{n,nax , k) MDS code, and written each chunk as a separate 
object using a unique key into the storage cloud. When the 
file is to be read, proxy schedules n read tasks for distinct 
chunks using n threads (not necessarily distinct ones) such 
that k < n < Umax- Earliest k successful responses from the 
storage cloud would then be sufficient to complete the read op- 
eration as k chunks can be decoded to the original file chunks 
without requiring the remaining chunks (thus the read tasks 
for those chunks can be cancelled). Notice that we implicitly 
assumed parallel independent task handling. If the tasks cannot 
be served in parallel or have strong correlation in their service 
latencies, FEC would impede the delay performance due to 
the extra load and processing overheads it generates. 

Write operations are supported in a similar vein. Proxy can 
divide the file into k chunks of equal size and encode them 
into n coded chunks. The proxy then creates n write tasks, one 
for each coded chunk. It schedules the tasks using n threads. 
As soon as any k of the n uploading tasks complete, sufficient 
data has been stored on the cloud storage system. Thus, 
upon receiving k successful responses from the storage cloud, 
the write job completes and the proxy responds back with 
a success response. The outstanding tasks can be cancelled, 
preempted, or scheduled as background jobs depending on the 
subsequent read profile on the same file. If the file is written 
for archival purposes and is very rarely read, then proxy node 
can decide to keep only k chunks in the cloud storage. If 
file is to be read often enough, then eventually rimax chunks 
need to be written. Furthermore, there is a potential to use 
rateless codes with no pre-specified rimax for write operations 
as the unused chunks can be deleted immediately after the 
completion of write operation. 

We assume that the set of objects to be read and the set of 
objects to be written are disjoint. In other words, we assume 
there is no subsequent read requests for an object that has been 
written. While it would be nice to model the practical case 
when subsequent read and write requests for the same object is 
allowed, we believe it will significantly complicate the already 
intricate problem and will not provide much additional insight 
for the purpose of scheduling policy design. 

C. Queiieing Model with Multiple Threads and Coding 

Due to shared resources, the level of parallelism achievable 
by using multiple threads is limited: the system can only sup- 
port a finite number of simultaneously active threads without 
significantly degrading the performance of each individual 
active thread. Thus, we denote the maximum number of 
simultaneously active threads allowed in our system as L. 
Under this constraint, we assume that the performance of each 
individual active thread is independent of the total number of 
active threads during the span of its life time. 



Accordingly, we model our proxy system by the queueing 
system shown in FiglT] There are two FIFO (first-in-first- 
out) queues in the system: one request queue that buffers all 
incoming requests that have not started yet, and one task queue 
that holds all waiting tasks of requests being served. L threads 
are attached to the task queue. Whenever a thread becomes 
idle, it immediately start serving the head-of-line (HoL) task 
in the task queue. The scheduler monitors the state of the 
queues and the threads, and decides what code rate should 
be used for each request in the request queue. The scheduler 
instructs the dispatcher to remove the HoL request from the 
request queue only if there is at least one idle thread. The 
dispatcher then creates the tasks for this request according to 
the code rate chosen by the scheduler, and injects them into the 
task queue. The idle threads immediately start serving (some 
of) the newly injected tasks. At the time when a request is 
completed, if some of its tasks are waiting, the waiting tasks 
are removed from the task queue. For a completed request, if 
some of its tasks are still being served, they are cancelled and 
the threads serving them become idle. 

Depending on the criteria according to which the HoL 
request of the request queue should be admitted into the 
task queue, scheduling policies can be classified into the two 
categories below. Here, we assume that the scheduler has 
decided to serve the HoL request with an (n, k) code. 

• Blocking: The HoL request is admitted into the task 
queue if and only if there are at least n idle threads. 

• Non-blocking: The HoL request is admitted into the task 
queue if and only if there is at least 1 idle thread. 

Blocking policies are not work conserving, thus waste sys- 
tem capacity for keeping threads idle unnecessarily. However, 
it has a nice structure that facilitates tractable queue analysis 
and provides good approximation for non-blocking policies, 
which is work conserving but quite difficult to analyze. 

D. Multiple Classes of Requests 

In general, applications receive requests for both reading 
and writing for files of various sizes. From our measurement 
results (next section), it can be seen that the distributions 
of service times of tasks of different operation types and/or 
different chunk sizes differ significantly. Also, requests for 
different applications may have different delay targets (for 
example, video streaming has different delay requirements 
than uploading a document). As a result, it would be preferable 
to use different chunk sizes for different requests to accommo- 
date different delay requirements. It is then natural to group 
requests that have the same operation type, similar file sizes 
and similar delay requirements into one class and consider a 
composition of m > 1 classes of requests. Details of modeling 
multiple classes of requests will be presented in Section ITlI-BI 
The following discussion of this paper will concentrate on 
queue management and adaptation of the amount of redundant 
read/write operations, based on the assumption that classes 
are given and the corresponding file/chunk sizes are prede- 
termined. Determining the choices of these parameters as 
functions of different delay requirements remains part of our 
future work. 
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E. Definition of Delays 

Consider a time period [0, T]. We denote the set of requests 
arrived during this period by / = {1, 2, 3, • • • , • • • , Nt}, 
where r denotes the 7--th arrived request and Nt is the total 
requests during the period. For each request r, denote as 
the time when it arrives into the system. Given that request r 
is served with an (n, k) code, we index the corresponding n 
tasks from 1 to n, according to the time they start being served, 
and denote T^' ^ < ^ < • • • < " as their starting time. 
Also denote Tp-' as the completion time of task j of request 
r. Note that the tasks are only ordered by their starting time 
but not the completion time. So it is possible that Tp-' > 
Tp'' even if j < I. The starting time of a request r, denoted 
as Tg, is defined as the time it gets admitted into the task 
queue, i.e., the starting time of its first task. So Tg = Tg^. 
Its finish/completion time, denoted as Tp, should be the time 
when k of its tasks have finished. Let yp^ " < Tp^'" < ■ ■ ■ < 
rpr,n.n sorted permutation of the finish times of request 

r's tasks. Then = Tp'''-'\ 

The queueing delay for request r is the length of time that 
it spends waiting in the request queue, denoted by = T's — 
T\. The sendee delay for request r is the time it spends in 
the system getting served, denoted by D^' = Tp — Tg. We 
also denote the task delay for task j of request r by D^'^ = 
Tp-' — Tg-' unless the task is cancelled. When the task is 
cancelled (because k other tasks for the same request have 
completed), D"^'^ = T^ ~ Tg'' . 

III. Measurement Results and Delay Model 
A. Measurement Results 

To model the distributions of service times {D^'^) of indi- 
vidual tasks, we run measurements over Amazon EC2 and S3. 
EC2 instance served as our proxy node in the system model. 
We instantiated an extra large EC2 instance with high I/O 
capability in the same availability region as the S3 bucket that 
stores our objects. We run experiments within North California 
as well as Tokyo regions. We benchmarked single thread 
vs. multiple thread environments to measure the impact of 
thread contention. For the machine type we used we were 
able to run 16 threads in parallel with almost linear gain 
in system throughput and observed almost identical delay 
distribution as single-thread. This means that for up to 16 
parallel threads the bottleneck is neither in the capacity of the 
EC2 instance nor in the network. We conducted experiments 
on different week days in March, April, June, and July 2012 
with various packet sizes 128Byte, 1KB, 0.5MB, 1MB, 2MB, 
and 3MB using 16 threads in parallel while saturating each 
thread. Each experiment lasted around 24 hours. We alternated 
between different packet sizes to capture similar time of day 
characteristics across packet sizes. For the same reasons, we 
also alternated between write and read jobs by first creating 
a batch of write jobs using distinct keys, then creating a 
batch of read jobs for these distinct keys once all the writes 
are completed successfully. Due to lack of space, we only 
show a limited set of results although the cross-correlation 
properties and cumulative distribution functions exhibit similar 
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Fig. 2. CCDF of D""'^ for read & write tasks for 1MB chuiilc. 



Fig. 3. Correlation between time ordered delay samples at different lags 

properties. We briefly present a representative subset of our 
main findings. 

FigHplots the complementary cumulative distribution func- 
tion (CCDF) of D'"'^ for read and write tasks of 1MB chunks. 
Note that we only measure the time spent in any thread and 
there are no queuing delays. Read tasks for small to medium 
object sizes experience lower mean and median delays than the 
write tasks, yet at higher percentile delays (in this plot beyond 
80th percentile) reads observe higher delays. Although not 
shown, as object size gets smaller the crossover point moves 
towards higher delay percentiles. 

Figl3] shows the autocorrelation coefficient between the 
service times of subsequent read tasks. Note that the mean 
delay is subtracted from the delay samples and normalization 
is done with respect to 0-lag. Except for the 0-lag, there is 
negligible correlation between subsequent delays. This ob- 
servation is critical as EEC techniques would be too costly 
and with little benefit if there were a strong correlation. The 
observation holds for all the packet sizes we experimented 
with as well as for the write tasks. Based on these results, for 
further analysis, we will treat task service times as independent 
and identically distributed (i.i.d.). In actual evaluations of the 
proposed solutions though we will use the traces we collected 
during our experiments. 

To show the impact of using different codes on the service 
times (i.e., as opposed to D^'^), we plot the case for 
2MB files with codes ranging from (1, 1) to (7,4) in Fig|4] 
Codes (1, 1), (2, 2), (4, 4) do not employ EEC, but instead use 
different chunk sizes. (2, 1) code provides 23%, 32%, and 56% 
reduction in mean, 90th percentile, and 99th percentile delays 
over (1, 1) using 2x more storage. Using smaller chunk sizes 
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Fig. 5. Multi-phase queueing model. A blue server indicates a request is 
being served at the coiTesponding phase. A grey server indicates that there is 
a request in the same pipe and being served at a later phase. A white server 
indicates it has not served the request in the same pipe or there is no request 
in the pipe. The numbers at the bottom of each phase are the number of busy 
servers and the service rate of each server of that phase. 



Fig. 4. CCDF of service times for reading 2MB file using different 
chunk/object sizes and FEC rates. 

with FEC improves delays at the same or less storage cost. 
E.g., (3,2) code provides 50%, 55%, and 69% reductions in 
mean, 90th percentile, and 99th percentile delays over (1,1) 
using 1.5 X storage, (5,4) code gives more than 60% reductions 
in the same percentiles using only 1.25 x storage, and (7,4) 
code improves delays by 76%, 80%, and 85% at the expense 
of 1.75 X storage. Using smaller chunk sizes without FEC 
improves mean delay performance, but at higher percentiles 
the benefits deteriorate. This is expected as uncoded chunking 
requires completion of all tasks and small chunk sizes also 
have a long tail. The chances of catching the tail increases as 
the number of chunks increases. FEC greatly mitigates this all 
or nothing behavior The gains in service delay Dg is only 
half of the story as chunking and FEC both adversely affect 
the achievable rate region as examined in later sections. 

B. Model of Task Delays 

From Fig|2] it can be observed that for both read and write 
tasks, despite the delay floors observed at very low percentiles, 
up to 99th percentile and even beyond that, the CCDF (in log 
scale) is roughly a constant term plus a linear term in delay. 
Together with the observation from Figi3]that task delays are 
weakly correlated over time, for tasks of the same operation 
type and same chunk size, we decide to model the task delays 
as i.i.d. random variables in the form of A + D^xp, where A 
is a non-negative constant (corresponding to the constant term 
in CCDF), and D^xp is an exponentially distributed random 
variable with some mean l//i (corresponding to the linear term 
in CCDF). For mean delay analysis, our simulations later will 
show that this approximation works reasonably well. 

We assume there are m > 1 classes of requests. Requests 
of each class have identical file size and all are divided into 
chunks of identical size. Under this assumption, service times 
of all chunks of the same class follow the same distribution and 
each class i can be characterized by a three-tuple (fc,;, Ai, Hi), 
where A; and fii specifies the delay distribution of class-i 
chunks. Throughout this paper, we assume fc^'s (and accord- 
ingly chunk sizes) are determined a priori and {Ai,fii) are 
given. Our focus will be on the adaptation/choice of n^'s. 

IV. Single-Class (Homogeneous) Arrivals 

In this section, we study the scenario when there is only one 
class of request, i.e., m = 1. Since there is only one class, we 



will drop the subscript i within this section. 

We first investigate the delay and throughput tradeoff with 
fixed FEC, i.e., a fixed (71, fc) code is used for all requests, 
for both blocking and non-blocking schemes. Due to the 
interdependent nature of task delays while employing FEC, the 
queueing model for these policies is much more complicated 
than M/G/k queue, which itself has only crude approximations 
for delays. We are not able to provide exact analysis at this 
time. However, we develop reasonable approximations for both 
capacity and delay of these policies. Based on these approx- 
imation results, we develop a backlog-based adaptive FEC 
scheduler BAFEC, which achieves the best delay performance 
against fixed FEC schemes for all supportable arrival rates. 

A. Queueing Model for Blocking Policies with Fixed FEC 

Given our assumption that task delay is in the form of A + 
-Dexp, it can be considered that after started being served by 
a thread, a task experiences two phases of services: first a 
fixed-time service for A, then followed by an exponential- 
time service with mean 1 / /^t. Recall that in blocking policies, 
all tasks of a request i start at the same time, i.e., Tg = 
T^'^ = ■■■Tg^. Then the service received by each request 
can be modeled in fc + 1 phases. The first is a fixed-delay 
phase of length A, while all n tasks are in their fixed-time 
service phase. The second is an exponential phase with mean 
l/nji, while all n tasks are receiving exponential- time service 
and one task finishes by the end; Similarly, the third is an 
exponential phase with mean l/(n — l)/i, while the remaining 
77,-1 tasks are receiving exponential- time service and one more 
task finishes by the end; • • ■ ; the (fc + l)-th is an exponential 
phase with mean 1/(77, — fc + l)/i, while the last 77 — fc + 1 tasks 
receiving exponential-time service and the fc-th task finishes 
by the end (hence the whole request finishes and the remaining 
tasks are cancelled). We will say a request is in phase A or 
phase j (77, — fc + 1 < j < n) depending on the number 
of its remaining tasks and the phase these tasks are in. Now 
we can model a blocking policy with the queueing system 
depicted in Fig. |5] There are [L/(77 — fc + 1)] pipes of server^ 
Each pipe consists fc + 1 servers that correspond to the fc + 1 
phases of service delay a request experiences: the first server 
has fixed service time A, the second has exponential service 
time with mean l/nji, the (fc+l)-th has exponential service 
time with mean 1/(77 — fc + l)/.i. Denote S/\{t) and Sj{t) 

' [L/(n — fc + 1)] is the maximum number of requests that can be served 
simultaneously. 
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(n — fc + 1 < j < n) as the number of requests being served in 
the corresponding phases at time t. Noticing that every request 
in phase A corresponds to active threads and every request 
in phase j corresponds to j active threads, the number of active 
threads at time t is n^A + J2j=n-k+i J^j- ^ request waiting 
in the queue is admitted into a pipe only if that pipe is not 
hosting any other request and there are at least n idle threads, 
i.e.. 



nSA{t)+ jS]{t)<L 



j—n—k-\-l 

B. Capacity of Blocking Policies 

Let Sa and Sj denote the time average of 5'a and Sj. 
Assuming the queueing system is stabilized at arrival rate 
A and noticing that arrival rate to each phase equals to A 
when the system is stable, we have the following flow-balance 
equations from Little's law: 



Sa AA 

5, = — , yn — k - 



1 < .7 < n. 



As a result, the expected number of simultaneously active 
threads at arrival rate A is 



uSa 



^ jSj = A(nA + k/^i). 



Since there are at most L parallel active threads allowed, we 
have the following constraint on supportable arrival rates: 



A(nA + k/ii) < L. 



(1) 



For the study of capacity, it suffices to consider the 
case when the system is always backlogged. When always- 
backlogged, whenever there are at least n idle threads, the 
HoL request will be admitted into one pipe. So the number of 
active threads is kept > L — n + 1. Then we have the following 
upper and lower bounds on Cb{n, k), the capacity of blocking 
policies using a fixed (n, k) code: 



L - n + 1 
nA + k/ 



<Ct,in,k) < 



nA + k/fj.' 



(2) 



While more accurate approximation is possible, we use the 
mean of the two bounds as our estimation for Ci,: 



Cb{n,k) 



L-{n- l)/2 



nA + k/ fi 

From the above discussion, we can see the capacity with 
fixed FEC is roughly proportional to the inverse of 

u{n) = nA + k/ fi. 

In fact, from our delay model, one can easily verify that 
IEE;=i D'-^'] ^nA + E;=n-fe+i ^ = Hn) (note that the 
slowest n — k threads are cancelled by time k threads finish). 
In other words, u{n) is the expected sum of the amount of 
time used by n threads in serving one request. For this reason, 
we call u{n) the expected per-request system usage for using 
{n, k) FEC code, or usage for short. The first term is linear 
in n and represents the constant per-thread cost A we pay for 



having more parallelism. As we can see from Eqi2] (especially 
upper bound), if A is large compared with 1 / /i, the capacity 
is significantly reduced when a low rate FEC code (large 7i) 
is used and the queueing delay will quickly explode even at 
low arrival rate with respect to the capacity with no coding 
Ct(fc, k). We are going to investigate the delay issue in more 
detail in the rest of this section. 

C. Delays of Blocking Policies 

According to our model for task delay, the expected service 
delay of a blocking policy is Ds{n, k) = A + J2]=n-k+i J^- 

For queueing delay, we approximate the request queue and 
dispatcher by a virtual single-server queue. The virtual server's 
service time for request r is determined by Tg'^^—Tg, i.e., the 
inter-starting time of the requests r and r + 1 in our original 
system. So from the request queue's point of view, the virtual 
server behaves exactly as the dispatcher, and the virtual queue 
has the same queueing delay as our original system. 

In general, the service time of different requests in the 
virtual system are not necessarily independent. In fact, the 
service time also depends on the arrival process. So the exact 
analysis of the queueing delay is very complicated. However, 
we notice that in low utilization regime the total delay is 
dominated by the service delay, which we know exactly. On 
the other hand in high utilization regime where queueing delay 
dominates, the system is mostly backlogged and the inter- 
starting times are weakly correlated and independent of the 
arrival process. For this reason, we use an M/G/1 queue 
approximation, wherein the service time follows an Erlang 
distribution with parameters ?? and mean 1/Cb- 

To understand the choice of Erlang distribution, consider the 
case when A = 0, n = fc. Suppose the system is backlogged 
and all L threads are busy immediately after Tg. Then Tg~^^ 
is the time when the earliest n out of L threads become idle. 
Since A = 0, all task delays are exponential. Then the inter- 
starting time is the sum of n exponential random variables, 
whose means are l/{L)fj,, ■ ■ ■ , 1/{L — n + This is very 
similar to an Erlang distribution with parameter n and mean 
J2j=o ^ which is the sum of n i.i.d. exponential 

random variables with mean ^j=p "'"^^'^ ■'^^ ^ When L/n is 
sufficiently large, the inter-starting time distribution converges 
to the Erlang distribution. When A > and n > k, this 
approximation can be quite crude. But we believe it is good 
enough as a guideline for the purpose of policy design. 
Moreover, it also provides a simple closed-form approximation 
of the queueing delay, which is used in design of our adaptive 
FEC scheduler. Given an Erlang random variable X with 
parameter n and mean 1/Cb, its second moment E[X^] = 
(1 + \/n)/Cl. Then queueing delay of the aforementioned 
M/G/1 queue (using the Pollaczek-Khinchin formula) is 

AE[X2] _ A(n + 1) 



Dl{n,k,X) 



2(1-AE[X]) 2nCbin,k){Cbin,k) - X) 



D. Approximations for Non-Blocking Policies with Fixed FEC 

The only difference between blocking and non-blocking 
policies is that non-blocking policy starts a task whenever a 
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A 


Blocking, L = 16 


Blocking, L = 64 


A + 1/m 


71 = 3 


n = 6 


n = 3 


n = 6 


0.2 


1.0 - 11.4 


2.0 - 20.6 


0.3 - 6.1 


0.5 - 8.6 


0.4 


1.0 - 13.4 


0.8 - 7.9 


0.3 - 7.8 


0.5 - 9.5 


0.6 


1.2 - 15.5 


1.9 - 63.2 


0.3 - 9.6 


0.6 - 11.7 


0.8 


1.3 - 18.0 


1.0 - 339.3 


0.4 - 11.6 


0.6 - 10.5 


A 


Non-blocking, L = 16 


Non-blocking, L = 64 


A + 1/m 


71 = 3 


n = 6 


n = 3 


n = 6 


0.2 


0.9 - 11.1 


1.8 - 8.5 


0.3 - 7.4 


0.5 - 8.4 


0.4 


1.0 - 13.1 


1.9 - 11.0 


0.3 - 8.2 


0.5 - 9.5 


0.6 


1.1 - 15.0 


2.0 - 16.8 


0.3 - 9.0 


0.5 - 10.4 


0.8 


1.2 - 16.4 


2.1 - 29.2 


0.3 - 10.0 


0.6 - 11.0 



TABLE I 

Range of errors: \Dsim — D\/D x 100% 



thread becomes available, while blocking policy waits until 
n threads become available. This difference is subtle yet 
it makes non-blocking policies much harder than blocking 
policies for exact analysis. In this section, we derive approxi- 
mations of the capacity and delays of non-blocking policies. 

Notice that, when there are L busy threads, the rate at 
which any single thread becomes available is in the order of 
0{L/{A + 1/ /i)), which is much higher than the rate at which 
one particular busy thread becomes idle when L is large. As a 
result, it is highly likely that, in a non-blocking policy, all tasks 
of a request will get started before any one of them finishes, 
and the gap between the first and last starting time of tasks 
are much smaller than the individual task delay. As a result, 
for large L, a non-blocking policy behaves very similarly to 
a blocking policy that uses the same FEC code. Hence, the 
capacity of a non-blocking policy can be approximated by the 
capacity of a blocking one. Further notice that, when always 
backlogged, non-blocking policies always keep all L threads 
busy. So we approximate the capacity of non-blocking policy 
with the upper bound for blocking: 

Then we again use PoUaczek-Khinchin formula to estimate the 
queueing delay of non-blocking policy Dg^, by replacing Cb 
with Cnb in the previous formulation for I?^, and use Dg = 
^ + S^=ri-fe+i ^ approximation of the service delay. 

By doing this, e(x] = u{n)/L for non-blocking. 

We compare the approximated delay of blocking and non- 
blocking poHcies D'' = Ds + b\ and = £><,+ 
against the average delay from simulations using task delays 
in the form of A + D^xp (denoted as D^^^ and Z^J^). Table 
U shows the range of estimation errors for fc = 3, n = 3, 6 
and L = 16, 64, while arrival rate varies from O.lCx to 0.9Cx 
(x = b OT nh). For each setting, the lower end of estimation 
error is observed at low to medium arrival rates while larger 
error is observed for arrival rates near the estimated capacity. 
This is mainly due to the high sensitivity of £) in C when 
A ^ C (because of the C— A term in the denominator), so even 
a small discrepancy between C and the actual capacity will be 
significantly magnified in delay at arrival rate close to capacity. 
As we can see, the approximations are quite reasonable, except 
for the cases when L = 16, n = 6 and . f\ , ~ 0.6, 0.8. This 
is because Erlang distribution is not a good approximation for 
the inter-starting time when A and n are large compared to 
1/ ^ and L, respectively. We also observe that approximation 
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Fig. 6. Estimation vs. trace-diiven simulation 

for non-blocking policy is generally better than the one for 
blocking policy. This is because the Erlang distribution is a 
much better approximation for the inter-starting time of non- 
blocking schemes since the number of busy threads remains 
fixed (equals to L) when the system is backlogged. 

We further compare the approximation against trace-driven 
simulations. Fig. |6] plots D"'' and the average delay from 
simulations for reading 3MB files with fixed FEC schemes 
with k = 3, n = 3,4,5,6 and L = 16, using traces for 
read operations we collected in June and chunk sizes of 1MB. 
For computation of Cnb, we first filter out the worst 0.1% 
task delays in the trace, then we set 1/^ and A + l//i as 
the standard deviation and the mean of the remaining task 
delays, respectively. We emphasize that although we use the 
filtered task delays to obtain estimations of A and 1/^, all 
unfiltered task delays are used in the simulations. As we can 
see, our approximation matches the simulation results very 
well, which justifies our A + D^^xp model for task delays. 
The simulation results also suggest that the capacity of non- 
blocking policies with fixed FEC is a decreasing function of 
n, which is consistent with our approximation of C"'' from 
Eq. [3] We also plot the delay for the simple no chunking 
solution (using (1, 1) code), as well as simple 2x replication 
solution (using (2, 1) code) using traces for chunk size 3MB 
collected in the same time period. Despite providing a larger 
capacity, the simple no chunking solution has very bad delay 
performance. Even for very low arrival rates, the delay is over 
300 ms, while just chunking without FEC (n = 3) improves 
the delay to about 200 ms with zero storage overhead and 
using a (4, 3) code with 1/3 storage overhead improves the 
delay to less than 150 ms. Moreover, it is a bit surprising 
that simply replicating unchunked objects not only fails in 
improving delay but it also significantly reduces capacity. This 
is because read/write operations for large object has a small 
delay spread (i.e., A much larger than /x) according to our 
measurement results. This again justifies our motivation for 
using chunking with FEC for delay sensitive applications. With 
the same amount of storage overhead, using a (6, 3) code 
delivers roughly 3 x improvement in delay. 

E. BAFEC- Backlog-based Adaptive FEC Scheduler 

In this section, we present BAFEC- a backlog-based adap- 
tive FEC scheme that achieves the best delay achievable by 
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Fig. 7. Average delay, fixed FEC vs. greedy vs. BAFEC 

any fixed FEC scheme with k < n < nmax, i-c, 
min iDs{n,k) + Dq{n,k,X)], 

for all supportable arrival rates. The following discussion 
applies to both blocking and non-blocking policies, so we drop 
the superscript in the delay terms. 

Assuming that k is fixed, our estimation of the expected total 
delay is a function of n and A: D{n, A) = Ds{n) + Dq{n, A). 
For every n = k, - ■ ■ , Umax — 1, we compute the solution A„ 
such that 

^(n,A„) = ^(n + l,A„). (4) 

According to our previous analysis, it only requires solving 
a quadratic equation of A and only the smaller solution is 
meaningful. Due to limitation of space, we would not include 
the details. A„ is the crossover point for the delay performance 
of an {n,k) code and an {n + l,fc) code: if A < A„, then 
an (n + l,k) code gives smaller total delay than an {n,k) 
code does; and if A > A„, an (n, k) code will give smaller 
total delay. Using Little's law, we compute the corresponding 
crossover backlog size Q„ = XnDq{n, A„). It is easy to show 
that Qn is a decreasing function of n, then we can use {Q,i}'s 
as thresholds to adapt the FEC code for use based on backlog 
size. The adaptive scheme is described formally as follows: 

BAFEC (Backlog-based Adaptive FEC) 
Do the following for every request r 

1: Q <— backlog size upon arrival of request r. 

2: Find n such that Q G [Qn,Qn-i), or Q G [Qn,oo) for 

n ^ k, OT Q e [0, Qn) for n = rimax- 
3: Serve request r with an {n, k) code when it becomes HoL. 

F. Performance Evaluation 

We conduct trace-driven simulations for performance eval- 
uation. Due to lack of space, we only show results for non- 
blocking versions, with A: = 3, n,riax = 6 and L ~ 16, using 
traces we collected in March and chunk size 1MB. Results 
for other settings of parameters and blocking versions are 
similar We also develop a simple Greedy heuristic scheme. 
Unlike BAFEC, Greedy does not require any knowledge of the 
distribution of task delays, yet it achieves competitive mean 
delay performance. In Greedy, the code used to serve request 
r is determined by the number of idle threads upon its arrival: 
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Fig. 8. Greedy vs. BAFEC, normalized delays 

if there are / > fc idle threads, use a {xmx\{l , Umax) , k) code; 
otherwise use a (fc, k) code, i.e., no coding. 

FigH plots the average delays of fixed FEC schemes with 
n = 3,4,5,6, as well as the delays of Greedy and BAFEC. 
As we can see. Greedy and BAFEC have almost identical 
performance in terms of average delay. Both adaptive schemes 
succeed in (roughly) achieving our goal of obtaining the 
lower envelop of the delay performance of the set of fixed 
FEC schemes. At lower utilization, they deliver over 3x 
lower delay compared no chunking and simple 2 x replication 
{n = l,2;fc = 1), and 2x lower delay compared to naive 
chunking (n ^ k ~ 3). 

We plot the average, 99% and 99.9% delays of Greedy and 
BAFEC, normalized by the best delays obtained from fixed 
FEC schemes, in Fig. [8] At very low and high arrival rates, 
these two adaptive schemes perform almost the same as the 
optimal fixed FEC scheme. This is because (1) with low arrival 
rates, there are no backlog most of the time and both schemes 
behave like a fixed FEC scheme with n = n„iax\ and (2) with 
high arrival rates, the system is always backlogged and both 
schemes behave like a fixed FEC scheme with n — k. In the 
intermediate region, BAFEC still traces the best performance 
of fixed FEC schemes very well, as it is almost identical to 
the best fixed FEC scheme in mean delay, and it stays within 
1.25 X and 1.5 x of the optimal, for 99% and 99.9% delays 
respectively. On the other hand, while Greedy also achieves 
almost optimal mean delay performance, it performs much 
worse for high percentile delays. For low to medium arrival 
rates. Greedy is about 1.5 x of the optimal in 99%, while 
BAFEC staying below 1.1 x. For 99.9%, the advantage of 
BAFEC is even more significant as greedy is consistently 
above 2x and even goes beyond 3.5 x of the optimal. 

V. Multiple-Class (Heterogeneous) Arrivals 

In this section, the scenario with multiple classes of requests 
(to > 1) is studied. As the multi-class problem is even more 
complicated than the single-class one, we again based our 
analysis on approximations of queueing and service delays. 
Our analysis shows that delay-optimal combination of code 
lengths (n,'s) has a well-defined structure that is helpful for 
designing practical rate adaptation schemes: 

> There is a one-to-one mapping between the optimal code 
lengths and the corresponding expected total queue length 
(all classes combined), irrespective of the arrival rates; 



9 



• The optimal code length of any class is a decreasing 
function of the expected total queue length. 

These analysis results suggest that (1) expected queue length is 
a good indicator of the optimal code lengths and (2) adaptation 
of each class can be done separately. Based on these in- 
sights, we develop a Multi-class Backlog-based Adaptive FEC 
(MBAFEC) scheme. In MBAFEC, each class i is associated 
with a set of thresholds computed using Eq|4] as in BAFEC, 
assuming the single-class scenario with only class-z requests; 
and code adaptation within each class is performed in the same 
way as in BAFEC. 

A. Fixed FEC Code Analysis 

We assume that arrivals of each class i follows a Poisson 
process with rate A, > 0, independent of other classes. So 
the combined arrivals consist a Poisson process at rate A = 
X]"=i ^i- The following notations and terminologies will be 
used for the subsequent discussion. 

> The (column) rate vector A = [Ai;-- - ; Am] and the 
composition vector a — [ai; ■ • • ; Om] = A/A. Note that 
< ai = Aj/A < 1 and X]™! = 1- 

• The code vector N = [ni; • • • ; n„i], given that Ui is the 
code length chosen for class-z requests. 

• The usage vector U{N) = [ui(ni); • • • ;Mm('T-m)], where 
Ui{ni) = niAi + ki/ Hi is the per-request usage of class-i 
requests. When it is clear from context, we will omit the 
function inputs {N and n^). 

We can easily generalize the multi-phase queueing model 
introduced in the previous section to incorporate multiple 
classes of requests. For every class, we construct a set of pipes 
as in Fig|5] with the number of servers in each pipe and their 
service rates as specified by the delay parameters of the class. 
A class-z request is admitted into a pipe for class i only if 
there are > idle threads. Similar to Eqd] for a given code 
vector N, a supportable rate vector A must satisfy 



1=1 



Kin,A, + k,/fi,) = A^U{N) = Xa^U{N) < L 



for system stability. Starting from this point, we only consider 
non-blocking (work conserving) policies. We approximate the 
capacity region with respect to N by the convex set 

C{N) = {A : A^U{N) < L}, 

and the capacity for a given composition of requests a is 

Ca{N) = L/6FU{N). 



as per similar reason for non-blocking schemes in single-class 
scenario. In terms of the second moment, one possibility is 
to generalize the Erlang approximation for single-class and 
consider X to be a mixture of different Erlang random vari- 
able: with probability ai, it follows Erlang distribution with 
parameter ni. While this is doable, it leads to a complicated 
expression and we believe it will only provide marginal extra 
insight for the purpose of scheduler design. For this reason, we 
make a simple and rough assumption that E[X^] = /3E^[X] 
for some constant > independent of a and N . Then 
the queueing delay is approximated by Pollaczek-Khinchin 
formula 



AE[X2 



2(1-AE[X]) 2(1-AE[X]) 
and the expected queue length is 



Q(iV,A)-A 



2L{L - Xa^U) ' 



2L{L - Ad^JJ) ~ 2L{L - Ktu) ' 



We also approximate the service delay of each class i by 



D 



A, 



Noticing that requests of all 



classes have the same expected queueing delay, we formulate 
the following optimization problem of finding the best fixed 
FEC scheme that minimizes the average delay 



/3A(d^t/)2 



min > aiDj 
N ^ 2L{L ~ Xa^U) 



+ }a,D,,i{n,) (5) 



1=1 



s.t. Ad U{N) < L and n,; > fc; - 1 V i = 1, • • • , m 

Worth pointing out is that we are only interested in the 
structure of optimal solution and will make use of it for our 
scheduler design rather than the accurate expression of the 
solution. For the following discussion, we will relax the integer 
requirement for n/s and allow rii to be any value > ki ~ 1. 

It is easy to verify that when ai > Vi the objective of 
the optimization problem Eq|5]is strictly convex in N. As a 
result, we can denote N*{A) as the unique optimal solution 
for rate vector A. Also let H{N) ^ {A\N = N*{A)}. In other 
words, H{N) is the union of all rate vectors for which N is 
the optimal choice of code lengths. In the case N is not the 
optimal for any rate vector, H{N) = {}. We say a code vector 
N is good if and only if H{N) ^ {}. Theorem [1] below is the 
main result of our analysis. 

Theorem 1: Any good code vector N should have the 
structure 



(6) 



Obviously, the capacity region is maximized when there is no where Si — Y^ 



We call C{K) the full 



coding, i.e., N = K = [ki; ■ 
capacity region. 

Similar to our previous discussion for the single-class 
scenario, we use a M/G/1 queue approximation to model the 
request queue and use Pollaczek-Khinchin formula to estimate 
the queueing delay. For a given composition vector d, the 
service time of this M/G/1 queue is modeled by some random 
variable X whose mean is 

E[X] = l/C&{N)^a^U/L, 



A'jPj 

j=o (n ~j)'^ • ^'^^ ™y svich good code vector N, 
H{N) is the part of the hyperplane defined by A'^U{N) = 
const{N) within the positive orthant (A^ > Vi), where 
const{N) is solely determined by N . As a result, while using 
the optimal N at rates A e H{N), the corresponding queue 
length is a function of only N: 



Proof: See Appendix lAl 



l3const{Nf 
2L{L - const{N)) 
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Fig. 9. Best combination of code lengths and queue length, fc^^ad 
k,,.rit„ = ?). Left: n„„„,^. Riaht: n,,;.it„. 
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Fig. 10. Best combination of code lengths and queue length, fcread = 3 
and fciurite = 2. Left: riread- Right: n^rite- 

For any pair of good code vectors N N', define ordering 
such that iV ^ TV' if any only if iii < n[ Vi. Similar for 
'V. Also, for two sets of rate vectors H{N) and H{N'), we 
say that H{N) -< H{N') if and only if H{N) is completely 
contained in the convex hull defined by H{N') and the origin. 

Corollary 1: The set of all good code vectors is totally 
ordered with respect to Moreover, the corresponding rate 
vector H{N) and queue length Qopt{N) are both decreasing 
functions of N . In other words, 

ViV >- N', H{N) -< H{N') and Qopt{N) < Qopt{N'). 

Proof: See Appendix lAl ■ 
An intuitive interpretation of Theorem [T] and Corollary [T] 
is as follows: The full capacity region C{K) is "sliced" into 
layers as hyperplanes H{Nys. One single (fractional) code 
vector is optimal for all rates within each layer When the 
optimal code vector is used, it produces identical expected 
queue length throughout the whole layer The layer furthest 
away from the origin (heavy workload) corresponds to the 
largest expected queue length. Since the arrival rates are so 
close to full capacity, any redundancy is detrimental hence 
no coding should be used. As we move to layers closer to 
the origin (light workload), the corresponding expected queue 
length reduces and we can afford to increase the amount of 
redundancy by using coding. 

Remember that Theorem [T| and Corollary [T] are derived 
based on non-integer relaxation of code lengths, as well as 
approximations of the queueing and service delay especially 
the assumption that E[X'^] = ^]E^[X]. To verify the valid- 
ity of these results in reality, we perform simulations with 
TO = 2 classes of requests, literally read and write, with 

kread — ^write ~ 3, ?lreod: '^lurite G {3,4,5,6} and L = 16, 

using traces we collected in March and chunk size 1MB. We 



run simulations for at different rate vectors {XreadT^write) 
with \read and Xwrite Varying from 0.05 x to Ix of Cread 

and Cwrite respectively, where Cread = L/kread{^read + 
^/ fJ-read) ^ud Gwrite = L / k^ritei^write + 1//^ ^rite) are the 

maximum arrival rates of read and write request the system 
can support. At each rate vector, we run simulations for all 
4x4 possible combinations of {riread, riwrite), and find the 
combination that produces the minimum total delay, and record 
the corresponding average queue length. 

We plot the simulation results in Fig|9] The x and y axis are 
the arrival rates of read {Xread) and write {Xiurite) requests, 
respectively. The full capacity region is the lower-left half 
below the diagonal dark red colored line. Beyond this line 
(top-right) the queue is unstable. Each block in these figures 
represents one rate vector and the colors of a block represent 
the combination of riread (left) and n^rite (right) that results in 
the smallest total delay among the simulations. Lightest color 
represents code length of 6 and the darkest represents 3. We 
also plot contours of queue length levels as colored curves in 
which points on the same contour/curve have the same average 
queue length (blue meaning small and orange meaning large). 
As we can see, except for a small number of blocks, the 
rate region is generally divided into 4 layers. Starting from 
6 coded blocks in the layer closest to the origin, the number 
of coded blocks decreases as moving away from the original 
and eventually becomes 3 in the outmost layer The small 
number of blocks of exception near the boundaries are due to 
the integer constraint on code lengths as well as randomness 
in our simulations. Moreover, both the boundaries of these 
layers and the contours of queue lengths are roughly straight 
lines, and the boundaries of layers in general are aligned with 
the contours of queue lengths at the corresponding arrival 
rates (some are not shown in the figures). This validates our 
predictions from Theorem [T] and Corollary [Tlthat (1) H{N) is 
a hyperplane (which is a line in the 2-dimension space); (2) 
Qopt{N) is constant within H{N)\ and (3) Both H{N) and 
Qopt{N) are decreasing functions of N. Another observation 
is that as arrival rates increase, n^rite drops earlier than riread 
does. This is because, according to our trace, while read and 
write of 1MB chunks have similar mean task delay (both 
around 140 ms), A^^ite is much larger than Aread (1 14 ms vs. 
61 ms), and as we discuss before in Section |IV] the queueing 
delay starts to dominate at lower utiUzation with larger A. 

It appears in Fig|9] that all contours of queue length are 
roughly parallel to the boundary of the full capacity region 
(the diagonal dark red line), which may suggest the illusion of 
H{N) being parallel to full the capacity boundary. We would 
point out that this is just a coincidence. In Fig[TO]we plot the 
results for kread = 3, k^rtte = 2, and n^rite G {2,3,4,5}. 
It is clear in this case that the contours are not parallel to the 
full capacity boundary, especially for low arrival rates. 



B. MBAFEC- Multi-class Backlog-based Adaptive FEC 

An important implication of Corollary [T] is that there is a 
one to one mapping from Qopt to the corresponding good 
code vector iV, since the set of good code vectors is totally 
ordered and Qopt is an strictly decreasing function of the good 
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code vectors. Roughly speaking, the larger Qopt is, the smaller 
(good) code vector should be. This suggests that generalizing 
the single-class scheduler BAFEC to accommodate multiple 
classes of requests is plausible. 

A natural and intuitive way of generalizing BAFEC is to 
first enumerate the set of good code vectors using the structure 
of good code vectors provided by EqjS] then sort these code 
vectors and solve for the corresponding backlog thresholds 
for every pair of consecutive code vectors as we did for the 
single-class scheduler BAFEC. At last, depending on which 
range between the thresholds the backlog size falls into, we 
pick the corresponding N. However, this approach is not quite 
feasible when the number of classes m is large, mainly due 
to the integer requirement for N. Notice that Eq|6] can be 
converted into a polynomial equation of Ui and rij, each 
of degree 2ki and 2kj respectively. A straightforward way 
of finding the set of good codes is to first pick the code 
length for a certain class, say ni without loss of general- 
ity, to be an integer under consideration, then solve Eqj6] 
numerically for the corresponding code lengths of the other 
classes. However, the solutions obtained by doing this are not 
necessarily integers. In fact, they wiU most Ukely be non- 
integers unless the values of Ai,^i,ki's happen to pair up 
perfectly. So for every such fractional solution of N (except 
for Til), we need to decide which of [n^J and [n^] to pick, 
for all i ^ 1. There is no obvious way to solve this other 
than enumerating all 2™^^ potential solutions, computing the 
expected delays and picking the best one. So the computational 
complexity is exponential in m for each integer value of 
ni. Such exponential complexity may be affordable for static 
algorithms which assume statistics of task delays (A and /i) to 
be fixed. But in reality delay statistics of cloud storage systems 
vary over time and need to be updated regularly in order to 
harvest the best performance. More importantly, stale delay 
statistics can be dangerous because if they are too optimistic 
compared to reality then the scheduling algorithm will tend 
to allocate more tasks per request than it should, which will 
result in large backlog and queueing delay. In such cases, the 
exponential complexity is forbiddingly expensive. 

In fact, the exponential complexity of computing the back- 
log thresholds can be avoided. The key is to observe that Qopt 
is also a decreasing function of each individual rii and there 
is also a one to one mapping from Qopt to rii, assuming the 
other classes are using the corresponding optimal code lengths. 
So instead of adapting iV as a whole, adaptation can be done 
for each rii separately. So instead of computing one set of 
^™ j^(n™"^ — /ci) backlog thresholds across which a transition 
in the code vector N occurs, we compute one smaller set 
of n™"^ — hi thresholds for each class i individually across 
which a transition in only rii occurs. Here n™"-^ denotes the 
maximum number of tasks allowed for a class-i request. These 
two approaches should produce the same set of thresholds but 
the separated approach avoids the combinatorial problem of 
enumerating the set of good code vectors at the first place. 
Denoting {Qi.fc. , ■ • • , Qi.n™"^} as the set of thresholds com- 
puted for class i, the pseudo-code for the MBAFEC scheduler 
we develop using the separate approach is as follows; 



MBAFEC (Multi-class Backlog-based Adaptive FEC) 

Do the following for every request r 

1: Q backlog size upon arrival of request r. 

2: i ■(r- class that the request r belongs to. 

3: Find n such that Q G [0, Qi^n) for n = n™""^, or Q e 

[Qi,n,Qi.n-i), or Q e [Qi,n,oo) for n^ki. 
4: Serve request i with an (n, ki) code when it becomes HoL. 



To compute the set of thresholds for each class, recall that 
Qopt{N) stays fixed for all A G H{N) according to Theorem 
[T] So it suffices to consider rate vectors along a certain 
direction specified by a fix composition vector a and find 
the crossover backlog sizes along that direction. In particular, 
for class i, we consider the direction along the i-th axis. In 
other words, we consider the class-i-only arrival case with 
ai — 1 and aj = Vj 7^ i. In the example of Fig|9l this 
is equivalent to finding the intersections for the boundaries of 
layers with the x axis (read-only arrival) in the left plot for 
the thresholds of read requests, and finding the intersections 
with the y axis (write-only arrival) for write requests. Further, 
noticing that MBAFEC behaves identically to BAFEC when 
arrivals are single-class, these intersections with the i-th axis 
can be computed using EqH] with parameters A,;, /i^, ki, just 
as we do for BAFEC in the previous section. 

C. Performance Evaluation of MBAFEC 

For performance evaluation, we perform simulations with 
m = 2 classes of requests, literally read and write, with 

kread = k^rite = 3, Uj-ead: "f^write G {3,4,5,6} and L = 16, 

using traces we collected in March and chunk size 1MB. We 
simulated three scenarios: read heavy (aread = 0.9), balanced 
(ctread ~ 0.5), and write heavy (aread ~ 0.1). We also 
extended Greedy to accommodate multiple classes: each class- 
i request uses (min(Z, fc^) or {ki,ki) code, depending 

on the number of idle threads I upon arrival. 

FigHT] illustrates the delay performance for MBAFEC and 
Greedy. We also run simulations with fixed FEC scheme with 
all 16 combinations of code lengths at every arrival rate of 
each scenario and use the best average delays (areadDread,Y + 
oiwriteDwrite.Y with Y = mean, 99% and 99.9%. Dread,Y and 
Dwrite,Y represent the mean, 99% and 99.9% delay for read 
and write requests), the best delays (mean, 99% and 99.9%) 
for read requests, and the best delay (mean, 99% and 99.9%) 
for write requests as baselines. We want to point out here that 
the combinations of code lengths that result in the best average 
delay, read delay, and write delay are not necessarily the 
same. We observe in our simulations that the combination that 
results in best read delay usually uses a large code length for 
read requests and the minimum code length for write requests 
nuirite = kuirite, which rcsults in high write delay. It is the 
opposite observation for codes that produce the best write 
delay. The combination that produces the best average delay 
is usually in between. So in these figures, we are comparing 9 
delay metrics of one adaptive scheme MBAFEC (or Greedy) 
against multiple fixed FEC schemes, each of which excels in 
one particular delay metric. 
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Fig. 11. Delay performance from simulations 



In the left column of FigHT] we plot the average delays. 
Similar to the results for the single class case, both MBAFEC 
and Greedy perform well and achieves roughly the same 
average mean and 99% delays as the best fixed FEC schemes 
throughout the full capacity region. MBAFEC also achieves 
the lower envelop of fixed FEC schemes in terms of average 
99.9% delay and outperforms Greedy. More interesting are 
the middle and right columns, in which we plot the read and 
write delays of MBAFEC and Greedy, normalized by the best 
corresponding delays with fixed FEC. MBAFEC and Greedy 
perform similarly in terms of mean delays and both stay within 
1.5 X of the best mean delays with fixed FEC. Remember 
this comparison is made against the fixed FEC scheme that 
produces the best mean read or write delay, which is different 
from the one that produces the best average delay. For 99% 
delays, MBAFEC and Greedy again perform similarly, while 
MBAFEC outperforms Greedy in read delay by around 20- 
50% at low to medium arrival rates (20% to 40% utilization 
level) and performs slightly worse than Greedy by 10-20% 
in write delay at medium to high arrival rates (60% to 90% 
utilization level). These two adaptive schemes perform quite 
differently in terms of 99.9% delays. For 99.9% write delay. 



MBAFEC and Greedy are similar and stay within 1.5 x of the 
best fixed FEC for most arrival rates in all three scenarios. 
On the other hand, MBAFEC constantly outperforms Greedy 
significantly in terms of 99.9% read delays in all three scenar- 
ios. MBAFEC stays within 1.5x, 1.8x and 2.4x of the best 
delay from fixed FEC in read heavy, balanced and write heavy 
scenarios respectively, while Greedy can perform as bad as 
2.9x, 3.7x and 4.2x in each scenario. There are two reasons 
for such difference of performance in read and write requests. 
Firstly, in our trace read operations have a much larger delay 
spread than write operations have. As a result, read requests 
benefit significantly by reducing service delay from parallelism 
with appropriately chosen code length, while write requests 
cannot benefit much due to its smaller delay spread. More 
importantly. Greedy is "class-oblivious" and it does not make 
use of the difference in delay statistics of different classes of 
requests in deciding the code length for each class. 

To better understand how MBAFEC and Greedy behave 
differently, we plot the code composition (the fractions of 
requests served by different code lengths) of read and write 
requests using MBAFEC and Greedy from 10% to 100% 
utilization levels. FiglTJ] shows the code compositions for the 
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balanced arrival scenario (plots for read/write heavy scenarios 
are similar). At each utilization level, the four bars represent 
the code compositions of read requests with MBAFEC, write 
requests with MBAFEC, read requests with Greedy, and write 
requests with Greedy, from left to right. For each bar, the col- 
ors represent the fraction of requests served with code length 
3, 4, 5 and 6, from bottom to top. Generally speaking, both 
schemes behave as expected: at low utilization, both schemes 
mostly use code length 6 since service delay dominates; as 
utilization increases, both become less aggressive and increase 
the fraction of requests served by smaller code lengths; at 
very high utilization, both reduce to no coding for both read 
and write requests (riread = Kead and n^inte = hvrite)- The 
major difference we observe between MBAFEC and Greedy is 
that the code compositions for read and write requests differs 
significantly with MBAFEC except for at very low and very 
high utilization levels, while they are almost identical with 
Greedy at all utilization levels. Remember that in Greedy, 
the code length used to serve a request is determined by the 
number of idle threads upon arrival of the request and the 
range of code lengths allowed to serve the request. Since 
we assume Poisson arrivals, both read and write requests 
should statistically observe the same distribution of number 
of idle threads. Also because both read and write requests 
have the same range of code lengths in our simulations, they 
result in having the same code composition. If different types 
of requests have different ranges of code lengths, the code 
compositions will be slightly different for the edge cases (not 
enough idle threads or too many idle threads). On the other 
hand, MBAFEC treats read and write requests very differently, 
given that read and write operations have very different delay 
distributions. For read requests, since delay of read operations 
has a small fixed component (A^ead) and a large exponential 
tail ifiread), the overhead in queueing delay of parallelism is 
much smaller than the benefit from service delay. So MBAFEC 
is more aggressive in using large code lengths (riread > 3). For 
write requests, since write operations has a large fixed delay 
component, MBAFEC is more conservative. For medium to 
high utilization levels, MBAFEC is even more conservative 
than Greedy for write requests (MBAFEC serves fewer write 
requests with Uwrite > 5 than Greedy does at 80% to 100% 
utilization). 

We also observe that at all utilization level. Greedy serves 
most requests with either the maximum or minimum value of n 
while MBAFEC serves a much larger fraction of requests with 
medium values of n. This all-or-nothing behavior of Greedy 
is the main reason for its poor performance at high percentile 
delays, since the service delay distribution of simple chunking 
without coding (?? = /c > 1) is only slightly better than doing 
nothing (n = k = 1). 

VI. Related Work 

FEC in connection with multiple paths and/or multiple 
servers is a well investigated topic in the literature [|6], (|7l, 
m, ini. However, there is very little attention devoted to 
the queueing delays. FEC in the context of network coding 
or coded scheduling has also been a popular topic from the 



50 percent read requests 




Utilization ievei 

Fig. 12. Composition of code lengths. Each group is ordered as [MBAFEC 
read, MBAFEC write, greedy read, greedy write]. 

perspectives of throughput (or network utility) maximization 
and throughput vs. service delay tradeoffs ifTOl . ifTTI . lfT2l . 
{T3\. Although some incorporate queuing delay analysis, the 
treatment is largely for broadcast wireless channels with quite 
different system characteristics and constraints. 

FEC has also been extensively studied in the context of 
distributed storage from the points of high durability and 
availability while attaining high storage efficiency 1141 . ifTSll , 
ifTS). A very recent work f4\ proposes a novel storage node 
scheduling policy and to the best of our knowledge is also the 
first work that investigates the queueing delays in a tractable 
fashion where different storage nodes host distinct coded 
blocks. The system model and the targeted problem in ||4l 
is however quite different which requires a fixed rate code 
and scheduling exactly k ~ 2 servers. Thus, service delay 
vs. queueing delay tradeoff due to FEC is completely absent. 
Furthermore, each server is assumed to have an exponential 
service time, which is not justified by our measurement results. 
The presented solution is neither throughput nor delay optimal. 

Another set of works that is closely related to our work 
looks directly into the delay performance of storage clouds 
H], ifTTl . The measurements results and interim conclusions in 
II] on Amazon S3 motivated our work. The paper presents the 
throughput-delay tradeoffs in service times as object sizes vary. 
They establish the skewness and long tails. They recommend 
to cancel long pending jobs and send a fresh request instead. 
Although the suggestion would work well for long tails, 
this would not lead to much delay improvement below 99th 
percentile. iflTl on the other hand focuses more closely on the 
throughput-service delay tradeoff and devise a data batching 
scheme. Based on the observed congestion authors increase 
or reduce the batching size. Thus, at high congestion, a larger 
batch size is used to improve the throughput while at low 
congestion a smaller batch size is adopted to reduce the delay. 
The chunk size in our work is similar to the batch size 
considered in |fT7| and it remains as a future work how to 
combine these complementary ideas. 

VII. Conclusion and Future Directions 

We presented novel solutions that combine parallel thread 
scheduling and FEC for accessing data stored in public clouds 
substantially faster in the sense of mean, 90th percentile, 99th 
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and higher percentile latencies. The solutions can be applied 
to other distributed data storage technologies that exhibit high 
delay variations for object or block storage. 

In the analysis of the problem, we admitted a mixed traffic 
load with multiple classes of files read/write requests. But, 
chunk and file sizes of each class were predetermined and 
fixed. We are currently working on analyzing and realizing 
adjustable chunk sizes within each class. The proposed back- 
logged based schemes depend on this analysis to compute 
the approximately optimal thresholds. The greedy solution 
however is generic and can pick the best chunking and FEC 
combination allowed by the available number of threads. 

In our work, we neglected the dollar amount cost of using 
redundant requests, e.g., Amazon S3 charges 0.01$ per 1000 
requests for PUT, COPY, POST, or LIST Requests and 0.01$ 
per 10,000 requests for GET and all other requests. For now, 
by limiting the code rate and level of chunking, we put upper 
bounds on these costs in our work. Since not all parts of data 
are delay sensitive, such costs can be managed by applying 
our techniques on a smaller fraction of the load (e.g., initial 
segments of a video file). Extensions to capture the cloud 
pricing in the problem formulation and devise scheduling 
schemes accordingly are part of our ongoing work. 



Appendix A 
Proofs of Theorem[T]and Corollary[T] 

Proof: First observe that the first term of the objective 
approaches oo as Xa^U — !> L and the second term approaches 
oo as 71; — > fci — 1. Since both terms are lower bounded by 0, 
it follows that the objective approaches oo at the boundary of 
the feasible region. Together with the fact that the objective is 
a strictly convex function of N, it follows that for any given 
feasible A, the optimal solution iV*(A) is strictly inside the 
feasible region. Since the objective is differentiable, its partial 
derivative equals only at N*. In other words, if for some N 
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equals to for all i, then N = N*{A). Here = -M*^ 
J2j'=o^ {m-j)'^ ■ condition is equivalent to 
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(7) 



Due to the uniqueness of the optimal solution, the other 
direction is also true: for any given good code vector N, if A 
satisfies Eq|7] then N = TV* (A) or equivalently A G H{N). 

An important property of good code vectors implied by EqjT] 
is that all good code vectors line up on the curve specified by 



Given this, for any good N, denote 7r(A^) = for any 

i, when EqlS] is satisfied. Then Eq|7] can be rewritten as 

L 



A'U = L 



l + 7r(A^) 



= const{N). 



(9) 



(8) 



In other words, H{N) = {A\}JU{N) = const{N)}. 

It is obvious that 7r(iV) is strictly decreasing of Ui > ki — 1, 
for all i. So tt{) is invertible and for any a > in the range 
of 7r() we have 7r^^(a) -< Tr^^{b). This implies that the good 
code vectors are totally ordered in decreasing order of 7r(). 

Consider any two good code vectors N y N'. For any 
A e H{N), A^C/(iV) = const{N). Note that const{N) 
is a strictly increasing function of tt{N), so it is a strictly 
decreasing function of N. Hence const{N) < const{N'), and 
we have A^[/(iV') < A^U{N) = const{N) <^const{N'). 
The first inequality is due to the fact that both A and U are 
> 0. Now we can conclude that any A G H{N) is strictly 
within the convex hull defined by H{N') and the origin. So 
H{N) < H{N'). 

It is easy to verify that Qopt{N) is an increasing function 
of const{N). Since const{N) is a decreasing function of N, 
Qopt{N) is a decreasing function of N. ■ 
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